Querying your data

Once your data is read in and available as a DataFrame, Pandas provides a whole suite of tools for extracting information from it.

Let’s start by looking at some example data which contains information about the amounts that people at a restaurant paid and tipped for their meals:

import pandas as pd

tips = pd.read_csv("./data/tips.csv")
tips

	total_bill	tip	day	time	size
0	16.99	0.71	Sun	Dinner	2
1	10.34	1.16	Sun	Dinner	3
2	21.01	2.45	Sun	Dinner	3
3	23.68	2.32	Sun	Dinner	2
4	24.59	2.53	Sun	Dinner	4
...	...	...	...	...	...
239	29.03	4.14	Sat	Dinner	3
240	27.18	1.40	Sat	Dinner	2
241	22.67	1.40	Sat	Dinner	2
242	17.82	1.22	Sat	Dinner	2
243	18.78	2.10	Thur	Dinner	2

244 rows × 5 columns

The first thing that you usually want to be able to do is to pull apart the overall table to get at specific bits of data from inside.

When using lists and dicts in Python, the square-bracket syntax was used to fetch an item from the container. In Pandas we can use the same syntax but it’s a much more powerful tool.

If you pass a single string to the square brackets of a DataFrame it will return to you just that one column:

tips["total_bill"]

0      16.99
1      10.34
2      21.01
3      23.68
4      24.59
       ...  
239    29.03
240    27.18
241    22.67
242    17.82
243    18.78
Name: total_bill, Length: 244, dtype: float64

Accessing a column like this returns an object called a Series which is the second of the two main Pandas data types. Don’t worry too much about these just yet but think of them as being a single column of the DataFrame, along with the index of the DataFrame.

If you pass a list of column names to the square brackets then you can grab out just those columns:

tips[["total_bill", "tip"]]

	total_bill	tip
0	16.99	0.71
1	10.34	1.16
2	21.01	2.45
3	23.68	2.32
4	24.59	2.53
...	...	...
239	29.03	4.14
240	27.18	1.40
241	22.67	1.40
242	17.82	1.22
243	18.78	2.10

244 rows × 2 columns

Aside: Indexing

Note that the outer square brackets are saying “I’m selecting data” and the inner square brackets are saying “I’m giving you a list of column names”.

In this case it gives you back another DataFrame, just with only the required columns present.

Getting rows

If you want to select a row from a DataFrame then you can use the .loc (short for “location”) attribute which allows you to pass index values like:

tips.loc[2]

total_bill     21.01
tip             2.45
day              Sun
time          Dinner
size               3
Name: 2, dtype: object

If you want to grab a single value from the table, you can follow the row label with the column name that you want:

tips.loc[2, "total_bill"]

21.01

Exercise

The size column in the data is the number of people in the dining party. Extract this column from the DataFrame.

Answer

import pandas as pd

tips = pd.read_csv("./data/tips.csv")

tips["size"]

0      2
1      3
2      3
3      2
4      4
      ..
239    3
240    2
241    2
242    2
243    2
Name: size, Length: 244, dtype: int64

Descriptive statistics

Now that we know how to refer to individual columns, we can start asking questions about the data therein. If you’ve worked with columns of data in Excel for example, you’ve probably come across the SUM() and AVERAGE() functions to summarise data. We can do the same thing in pandas by calling the sum() or mean() methods on a column:

tips["total_bill"].sum()

4827.77

tips["total_bill"].mean()

19.78594262295082

You can see a list of all the possible functions you can call in the documentation for Series. So for example, you can also ask for the maximum value from a column with the max() method.

tips["tip"].max()

7.0

In some situations, you don’t just want to get the value of the maximum, but rather to find out which row it came from. In cases like that there is the idxmax() method which give you the index label of the row with the maximum:

tips["total_bill"].idxmax()

So we know that the value of the maximum bill was £7 and it was found in the row with the label 170.

You can then use this information with the .loc attribute to get the rest of the information for that row:

index_of_max_bill = tips["total_bill"].idxmax()
tips.loc[index_of_max_bill]

total_bill     50.81
tip              7.0
day              Sat
time          Dinner
size               3
Name: 170, dtype: object

Exercise

Find the value of the tip that was paid for the smallest total bill.

Hint: Have a look at the documentation page for Series. There’s a function which works like idxmax() but finds the minimum.

Answer

import pandas as pd

tips = pd.read_csv("./data/tips.csv")

index_of_smallest_bill = tips["total_bill"].idxmin()
tips["tip"][index_of_smallest_bill]

0.7

Acting on columns

Functions like sum() and max() summarise down the column to a single value. In some situations we instead want to manipulate a column to create a new column.

For example, the data in the table is in British pounds. If we wanted to convert it into the number of pennies then we need to multiply each value by 100. In Pandas you can refer to an entire column and perform mathematical operations on it and it will apply the operation to each row:

tips["total_bill"] * 100

0      1699.0
1      1034.0
2      2101.0
3      2368.0
4      2459.0
        ...  
239    2903.0
240    2718.0
241    2267.0
242    1782.0
243    1878.0
Name: total_bill, Length: 244, dtype: float64

The data in row 0 was previously 16.99 but the result here is 1699.0, and likewise for every other row.

You can do any mathematical operation that Python supports, such as +, - and /.

Combining columns

Aside

Columns are actually combined by matching together their index labels, not strictly by their position in the column.

As well as operating on individual columns, you can combine together multiple columns. Any operation you do between two columns will be done row-wise, that is adding two columns will add together the two values from the first row of each, then the second row from each etc.

For example if we wanted to find out, for each entry in our table what the ratio between tip amount and total bill was, we could divide one column by the other:

tips["tip"] / tips["total_bill"]

0      0.041789
1      0.112186
2      0.116611
3      0.097973
4      0.102887
         ...   
239    0.142611
240    0.051508
241    0.061756
242    0.068462
243    0.111821
Length: 244, dtype: float64

Of course, if we want the tip percentage so we need to multiply the value by 100:

(tips["tip"] / tips["total_bill"])*100

0       4.178929
1      11.218569
2      11.661114
3       9.797297
4      10.288735
         ...    
239    14.261109
240     5.150846
241     6.175562
242     6.846240
243    11.182109
Length: 244, dtype: float64

It can get messy and hard-to-read doing too many things on one line, so it’s a good idea to split each part of your calculation onto its own line, giving each step its own variable name along the way.

tip_fraction = tips["tip"] / tips["total_bill"]
tip_percent = tip_fraction*100
tip_percent

0       4.178929
1      11.218569
2      11.661114
3       9.797297
4      10.288735
         ...    
239    14.261109
240     5.150846
241     6.175562
242     6.846240
243    11.182109
Length: 244, dtype: float64

Exercise

The total_bill column give the total amount for the entire dining party. Calculate the amount spent per person for each row in the DataFrame.

Extra: calculate the average and the standard deviation of this data. You might need to take a look at the documentation page for the Series type.

Answer

import pandas as pd

tips = pd.read_csv("./data/tips.csv")

bill_per_person = tips["total_bill"] / tips["size"]
bill_per_person

0       8.495000
1       3.446667
2       7.003333
3      11.840000
4       6.147500
         ...    
239     9.676667
240    13.590000
241    11.335000
242     8.910000
243     9.390000
Length: 244, dtype: float64

bill_per_person.mean()

7.888229508196722

bill_per_person.std()

2.9143496626221

Adding new columns

New columns can be added to a DataFrame by assigning them by index (as you would for a Python dict):

tips["percent_tip"] = (tips["tip"] / tips["total_bill"])*100
tips

	total_bill	tip	day	time	size	percent_tip
0	16.99	0.71	Sun	Dinner	2	4.178929
1	10.34	1.16	Sun	Dinner	3	11.218569
2	21.01	2.45	Sun	Dinner	3	11.661114
3	23.68	2.32	Sun	Dinner	2	9.797297
4	24.59	2.53	Sun	Dinner	4	10.288735
...	...	...	...	...	...	...
239	29.03	4.14	Sat	Dinner	3	14.261109
240	27.18	1.40	Sat	Dinner	2	5.150846
241	22.67	1.40	Sat	Dinner	2	6.175562
242	17.82	1.22	Sat	Dinner	2	6.846240
243	18.78	2.10	Thur	Dinner	2	11.182109

244 rows × 6 columns

Exercise

Take the “bill per person” result you calculated in the last exercise and add it as a new column, bill_per_person, in the DataFrame.

Answer

import pandas as pd

tips = pd.read_csv("./data/tips.csv")

tips["bill_per_person"] = tips["total_bill"] / tips["size"]
tips

	total_bill	tip	day	time	size	bill_per_person
0	16.99	0.71	Sun	Dinner	2	8.495000
1	10.34	1.16	Sun	Dinner	3	3.446667
2	21.01	2.45	Sun	Dinner	3	7.003333
3	23.68	2.32	Sun	Dinner	2	11.840000
4	24.59	2.53	Sun	Dinner	4	6.147500
...	...	...	...	...	...	...
239	29.03	4.14	Sat	Dinner	3	9.676667
240	27.18	1.40	Sat	Dinner	2	13.590000
241	22.67	1.40	Sat	Dinner	2	11.335000
242	17.82	1.22	Sat	Dinner	2	8.910000
243	18.78	2.10	Thur	Dinner	2	9.390000

244 rows × 6 columns